Statistical Computing¶
We now start a new module on the interface between statistics and computing. Statistical (or mathematical) concepts need to be implemented as algorithms and data structures, and key issues of accuracy, space and time considered. Briefly, we look at the following:
Space complexity
Time complexity
Big O notation for space and time complexity
Computer representation of numbers
Integers
Floating point
Decimal
Arbitrary precision
Vectors
Dense and sparse matrices
Tensors
Accuracy considerations
Limits and overflow
Round-off errors
Catastrophic cancellation
Working in log space
Condition number
Linear algebra foundations
Many problems in statistics can be formulated as linear algebra problems
Vectors and vector spaces
Vector spaces and subspaces are closed under addition and scalar multiplication
Viewed as data
Viewed as mathematical object
Inner products
Outer products
Projection
Vector norms
Linear independence
Matrices
Types and examples
Important matrices
Symmetric positive definite
Orthogonal
Span, basis, rank
Matrix norms
Trace and determinant
Eigenvalues and eigenvectors
Column space
Null space
The four fundamental subspaces
How to think about matrix vector multiplication
As linear transform - rotate, reflect, stretch, contract
As weighted combination of column vectors
How to think about matrix-matrix multiplication
Row times column
Column times row gives sum of rank one matrices
Upper times upper = upper
Orthogonal times orthogonal = orthogonal
Matrix factorization
Review \(A = LU\)
Row pivoting as a numerical consideration
Permutation matrix
Review \(A = QR\)
Gram-Schmidt procedure
column pivoting as a numerical consideration
\(A = V \Lambda V^{-1}\)
Spectral theorem
Geometry
Similarity transforms preserve eigenvalues
\(A = U \Sigma V^{T}\)
SVD
Geometry
Pseudo-inverse
SVD generates basis for fundamental subspaces
Non-negative and sparse matrix factorizations
Important linear algebra problems
\(Ax = b\) when \(m = n = r\)
\(Ax = b\) when \(m > n\)
\(Ax = b\) when \(n > m\)
\(Ax = b\) when \(A\) is nearly singular
General approaches
Matrix factorization
Iteration
Randomization
Optimization
Application examples
Least squares regression
Markov chains
PCA
Graphs
Optimization foundations
Root finding
Bisection vs Newton-Raphson
Link with optimization
Zeroth order methods
Second order methods
Convexity and Newton’s method
First order methods
Gradient descent
Stochastic gradient descent
ADAM and friends
Constrained optimization
Lagrange multipliers
Big O complexity¶
A function \(f(n)\) had Big O complexity \(g(n)\) if \(\vert f(n) \vert \le M g(n)\) where \(M\) is a constant. Common classes for \(g(n)\) in increasing order of complexity are
\(\log n\)
linear \(n\)
\(n \log n\)
polynomial \(n^k\)
exponential \(e^n\)
factorial n!
Notes
Note 1: parallel processing in most cases gives at best a linear speedup.
Note 2: The constant factor can be important, especially for small to moderate values of \(n\)
[4]:
import numpy as np
from functools import reduce
[5]:
def factorial(n):
return reduce(lambda a, b: a* b, range(1, n+1))
[6]:
for n in (5, 20, 50):
print('n =', n)
print('-'*40)
print('log ', int(np.log2(n)))
print('linear ', n)
print('n log n', int(n*np.log2(n)))
print('n^2 ', n**2)
print('n^3 ', n**3)
print('e^n ', int(np.exp(n)))
print('n! ', factorial(n))
print()
n = 5
----------------------------------------
log 2
linear 5
n log n 11
n^2 25
n^3 125
e^n 148
n! 120
n = 20
----------------------------------------
log 4
linear 20
n log n 86
n^2 400
n^3 8000
e^n 485165195
n! 2432902008176640000
n = 50
----------------------------------------
log 5
linear 50
n log n 282
n^2 2500
n^3 125000
e^n 5184705528587072045056
n! 30414093201713378043612608166064768844377641568960512000000000000
Example¶
If you have to search a sequence container repeatedly, it is more efficient to first sort, then use a bisection algorithm.
Initial sort takes \(n \log n\) time
Subsequent searches takes \(\log n\)
Total time is \(n \log n + k \log n\) versus \(k \times n/2\)
[7]:
testx = np.random.randint(0, 2*n, 1000)
[8]:
%%time
n = 10000
xs = list(range(n))
hits = 0
for x in testx:
if x in xs:
hits += 1
print(hits)
1000
CPU times: user 11.7 ms, sys: 492 µs, total: 12.2 ms
Wall time: 12.1 ms
[9]:
import bisect
[10]:
%%time
n = 10000
xs = list(range(n))
xs.sort()
hits = 0
for x in testx:
if bisect.bisect(xs, x) != n:
hits += 1
print(hits)
1000
CPU times: user 4.69 ms, sys: 1.03 ms, total: 5.73 ms
Wall time: 5.01 ms
Sorting algorithms¶
Generally, use the sort function provided by the language (e.g. sort
, sorteed
). However sort functions are useful to illustrate algorithmic concepts such as in-place editing, recursion and algorithmic complexity, and you should know how to write simple sort functions.
How much memory does the sort use?
What is its big O complexity class?
Is it iterative or recursive? (note - all recursive algorithm have an iterative equivalent, but some (e.g. quicksort) are easier to think of in a recursive way.
Bubble sort¶
[11]:
def bubblesort(xs):
"""Bubble sort."""
n = len(xs)
for i in range(n):
print(xs)
for j in range(i+1, n):
if xs[i] > xs[j]:
xs[i], xs[j] = xs[j], xs[i]
[12]:
xs = [5,1,3,4,2]
bubblesort(xs)
xs
[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 5, 4, 3]
[1, 2, 3, 5, 4]
[1, 2, 3, 4, 5]
[12]:
[1, 2, 3, 4, 5]
Selection sort¶
[13]:
def selectionsort(xs):
"""Selection sort."""
n = len(xs)
for i in range(n):
best = xs[i]
idx = i
print(xs)
for j in range(i+1, n):
if xs[j] < best:
best = xs[j]
idx = j
xs[i], xs[idx] = xs[idx], xs[i]
[14]:
xs = [5,1,3,4,2]
selectionsort(xs)
xs
[5, 1, 3, 4, 2]
[1, 5, 3, 4, 2]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[14]:
[1, 2, 3, 4, 5]
Quicksort¶
[15]:
def quicksort(xs):
"""Quicksort."""
if len(xs) < 2:
return xs
else:
pivot = xs[0]
left = [x for x in xs[1:] if x <= pivot]
right = [x for x in xs[1:] if x > pivot]
print(pivot, left, right)
return quicksort(left) + [pivot] + quicksort(right)
[16]:
xs = [5,1,3,4,2]
quicksort(xs)
5 [1, 3, 4, 2] []
1 [] [3, 4, 2]
3 [2] [4]
[16]:
[1, 2, 3, 4, 5]
Memory usage¶
[17]:
import sys
[18]:
xs = np.random.randint(0, 10, (100,100))
[19]:
sys.getsizeof(xs)
[19]:
80112
[20]:
xs.nbytes
[20]:
80000
Timing¶
[21]:
from time import sleep
[22]:
%time sleep(0.1)
CPU times: user 628 µs, sys: 951 µs, total: 1.58 ms
Wall time: 104 ms
[23]:
%%time
sleep(0.1)
CPU times: user 882 µs, sys: 1.26 ms, total: 2.14 ms
Wall time: 103 ms
[24]:
%timeit sleep(0.1)
102 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[25]:
%timeit -r 3 -n 10 sleep(0.1)
101 ms ± 142 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
[26]:
from timeit import timeit
[27]:
t = timeit('from time import sleep; sleep(0.1)', number=1)
t
[27]:
0.10093476400000156
[28]:
t = timeit(lambda: sleep(0.1), number=1)
t
[28]:
0.10023614800000047